install.packages("rpart")
Error in install.packages : Updating loaded packages
library(ROSE)
Loaded ROSE 0.0-3

TODO: Needs to chech accuracy, precision and f-measure

Nessa atividade você irá usar seus conhecimentos sobre classificação para prever quais candidatos à Câmara de Deputados serão eleitos nas eleições de 2014. De forma específica faremos o seguinte:

1 Há desbalanceamento das classes (isto é, uma classe tem muito mais instâncias que outra)? Em que proporção? Quais efeitos colaterais o desbalanceamento de classes pode causar no classificador? Como você poderia tratar isso? (10 pt.) 2 Treine: um modelo de KNN, regressão logística, uma árvore de decisão e um modelo de adaboost. Tune esses modelos usando validação cruzada e controle overfitting se necessário, considerando as particularidades de cada modelo. (20 pts.) 3 Reporte precision, recall e f-measure no treino e validação. Há uma grande diferença de desempenho no treino/validação? Como você avalia os resultados? Justifique sua resposta. (10 pt.) 4 Interprete as saídas dos modelos. Quais atributos parecem ser mais importantes de acordo com cada modelo? (20 pts.) 5 Envie seus melhores modelos à competição do Kaggle. Faça pelo menos uma submissão. Sugestões para melhorar o modelo: (20 pts.) >> 1 Experimente outros modelos (e.g. SVM, RandomForests e GradientBoosting). >> 2 Experimente balancear as classes, caso estejam desbalanceadas. >> 3 Experimente outras estratégias de ensembles (e.g. Stacking)

Os dados estão neste link: https://www.kaggle.com/c/ufcg-cdp-20182-lab3/data (Links para um site externo)Links para um site externo

Para a entrega envie o link no RPubs e os arquivos .Rmd com o código em R. Para as respostas esperamos explicações textuais e visualizações para cada questão.

Setting up workspace

setwd("~/git/data-analysis/lab03/")
Error in names(frame) <- `*vtmp*` : names() applied to a non-vector

Loading DATA

Our data frame will be the train.csv file, in which we’ll peform predictions models and test.csv will be used to Caggle challenge.

Here we gonna see the correlation between the variables, then will se the ones which has a strong correlation and remove, because keep both would be redundant for our prediction model.

data.correlation %>% 
  select(-partido,
         -uf,-grau,-sexo) %>%
  na.omit() %>%
  ggcorr(palette = "RdBu",
         color = "grey50",
         label = TRUE, hjust = 1,
         label_size = 3, size = 4,
         nbreaks = 5, layout.exp = 7) +
  ggtitle("Gráfico de correlação eleições 2006")
data in column(s) 'ocupacao', 'situacao' are not numeric and were ignored

We choosed to remove those three categoric variables in order to run the model, otherwise it would take too much time. But for a better result you could let them on the data. And also remove those variable which have strong correlation

test.kaggle <- test.kaggle %>%
  select(-cargo, -nome, -ocupacao, total_despesa, -total_receita)
Error in is_character(x) : object 'cargo' not found

In the data would be better replace the NA for the column media, but we choosed replace by zero.

test.kaggle[is.na(test)] <- 0
Error in `[<-.data.frame`(`*tmp*`, is.na(test), value = 0) : 
  unsupported matrix index in replacement

As our target is to predict the variable situacao we need to see if our data is balanced, so what is the class distribution?

cleary unbalanced! So what should we do? We gonna balance it. There is some ways to balance data which are: >> 1. Undersampling That method reduces the number of observation from the majoritary class in order to balance the data set. >> 2. Oversampling This method increase the number of observation from the minoritary class and make it balanced. >> 3. Both Sampling Here it uses the technique 1 and 2 to make the data set balanced >> 4. ROSE Sampling Data synthetic generation and it provades a better stimation of original data.

Before balance it we gonna do a experiment. Let’s create a model and see how is goes whitout balance in order to predict and see accuracy to compare in the future

For tu build our models we gonna need data to train and test so we’ll divid the original data into train and test, 70% to raing and 30% to test.

Decision Tree whit unbalanced data

accuracy.meas(unbalanced.test$situacao, pred.treeimb[,2])

Call: 
accuracy.meas(response = unbalanced.test$situacao, predicted = pred.treeimb[, 
    2])

Examples are labelled as positive when predicted is greater than 0.5 

precision: 0.955
recall: 0.949
F: 0.476
roc.curve(unbalanced.test$situacao, pred.treeimb[,2], plotit = F)
Area under the curve (AUC): 0.890

Surprisely we’ve got a good precision and recall. Anyways let’s see how it goes whit balabced data.

Lets balance it, all data by using the 4 method ROSE Sampling which it gonna generate syntetich data.

table(data.rose$situacao)

nao_eleito     eleito 
      3842       3780 

YEAH!

It looks pretty balanced now. That is great, so we now gonna peform some models and avaliate its metrics.

Yes, we need to particionate our balanced data now, using the same schema before.

2 Treine: um modelo de KNN, regressão logística, uma árvore de decisão e um modelo de adaboost. Tune esses modelos usando validação cruzada e controle overfitting se necessário, considerando as particularidades de cada modelo. (20 pts.)

knn

First model is Knn.

k-nearest neighbour classification for test set from training set. For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote.

model.knn
k-Nearest Neighbors 

5336 samples
  17 predictors
   2 classes: 'nao_eleito', 'eleito' 

Pre-processing: centered (30), scaled (30), remove (49) 
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 4802, 4803, 4803, 4802, 4802, 4802, ... 
Resampling results across tuning parameters:

  k  Accuracy   Kappa    
  5  0.8001493  0.5992494
  7  0.7857002  0.5702039
  9  0.7824398  0.5636275

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.
knn_cv
Accuracy    Kappa 
  0.7979   0.5947 

Logistic Regression

Second model to be build. That model aims to fit a regression curve, y= f(x), when y is a categorical variable.

model.logistic_reg
Boosted Logistic Regression 

5336 samples
  17 predictors
   2 classes: 'nao_eleito', 'eleito' 

Pre-processing: centered (30), scaled (30), remove (49) 
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 4803, 4802, 4803, 4803, 4802, 4802, ... 
Resampling results across tuning parameters:

  nIter  Accuracy   Kappa    
  11     0.9311837  0.8623562
  21     0.9626129  0.9252215
  31     0.9642806  0.9285550

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was nIter = 31.
logistic_reg_cv
Accuracy    Kappa 
  0.9637   0.9274 

Decision Tree

Third model Decision tree is a graph to represent choices and their results in form of a tree. The nodes in the graph represent an event or choice and the edges of the graph represent the decision rules or conditions.

new_index <- createDataPartition(data.rose$situacao, p = 0.7, list = FALSE)
new_train_data <- data.rose[index, ]
new_test_data  <- data.rose[-index, ]


new_treeimb <- rpart(situacao ~ ., data = new_train_data)
new_pred.treeimb <- predict(new_treeimb, newdata = new_test_data)


accuracy.meas(new_test_data$situacao, new_pred.treeimb[,2])
model.tree_dec
CART 

5336 samples
  17 predictors
   2 classes: 'nao_eleito', 'eleito' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 4802, 4802, 4802, 4803, 4802, 4803, ... 
Resampling results across tuning parameters:

  cp          Accuracy   Kappa    
  0.05442177  0.8897316  0.7794148
  0.15873016  0.8324575  0.6642744
  0.59372638  0.6332076  0.2614189

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.05442177.
tree_cv
Accuracy    Kappa 
  0.8648   0.7295 

AdaBoost

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.



model.adaboost <- train(situacao ~ media_receita + media_despesa, 
               data = data.rose,
               trControl = fitControl,
               method = 'adaboost', 
               metric = "Accuracy",
               preProcess = preProcess)

model.adaboost
adaboost_prediction <- predict(model.logistic_reg,train)

adaboost_data <- data.frame(pred = logistic_reg_prediction, obs = train$situacao)

adaboost_cv <- round(defaultSummary(logistic_reg_data),digits = 4)

adaboost_cv

Which atribuite are most important to each model

4 Interprete as saídas dos modelos. Quais atributos parecem ser mais importantes de acordo com cada modelo? (20 pts.)

KNN

varImp(model.knn)
ROC curve variable importance

                                      Importance
recursos_de_pessoas_juridicas            100.000
recursos_de_pessoas_fisicas               91.044
media_receita                             81.484
quantidade_fornecedores                   74.186
quantidade_despesas                       73.979
quantidade_doadores                       47.197
quantidade_doacoes                        47.047
media_despesa                             46.762
recursos_de_partido_politico              44.599
recursos_de_outros_candidatos.comites     40.221
recursos_proprios                         36.430
grau                                      28.731
uf                                        26.858
partido                                   23.780
estado_civil                              22.325
sexo                                       5.491
ano                                        0.000

Logistic Regression

varImp(model.logistic_reg)
ROC curve variable importance

                                      Importance
recursos_de_pessoas_juridicas            100.000
recursos_de_pessoas_fisicas               91.044
media_receita                             81.484
quantidade_fornecedores                   74.186
quantidade_despesas                       73.979
quantidade_doadores                       47.197
quantidade_doacoes                        47.047
media_despesa                             46.762
recursos_de_partido_politico              44.599
recursos_de_outros_candidatos.comites     40.221
recursos_proprios                         36.430
grau                                      28.731
uf                                        26.858
partido                                   23.780
estado_civil                              22.325
sexo                                       5.491
ano                                        0.000

Decision Tree

varImp(model.tree_dec)
rpart variable importance

  only 20 most important variables shown (out of 79)

                                      Overall
recursos_de_pessoas_fisicas            100.00
quantidade_doacoes                      89.12
recursos_de_pessoas_juridicas           78.10
quantidade_fornecedores                 61.01
quantidade_despesas                     58.30
quantidade_doadores                     32.17
recursos_de_partido_politico            31.33
recursos_de_outros_candidatos.comites   30.86
ufAP                                     0.00
ufMT                                     0.00
ano                                      0.00
ufSP                                     0.00
partidoPCO                               0.00
ufGO                                     0.00
ufAL                                     0.00
ufCE                                     0.00
`partidoPT do B`                         0.00
ufRR                                     0.00
`estado_civilDIVORCIADO(A)`              0.00
partidoPR                                0.00

AdaBoost

varImp(model.adaboost)

Kaggle challenge

As far we can see for ano and sexo we’ve a low outcome for importance, so those variables should be removed.

As propose in the activite we are going to use our best model to submite the votos prediction to the challenge in Kaggle.

5 Envie seus melhores modelos à competição do Kaggle. Faça pelo menos uma submissão. Sugestões para melhorar o modelo: (20 pts.) >> 1 Experimente outros modelos (e.g. SVM, RandomForests e GradientBoosting). >> 2 Experimente balancear as classes, caso estejam desbalanceadas. >> 3 Experimente outras estratégias de ensembles (e.g. Stacking)

TODO: It needs to be reviside

prediction_ <- predict(model.tree_dec , test.kaggle)
Error in eval(predvars, data, env) : object 'sexo' not found

usefull links: http://www.treselle.com/blog/handle-class-imbalance-data-with-r/ https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/ https://shiring.github.io/machine_learning/2017/04/02/unbalanced

---
title: "R Notebook"
output: html_notebook
---

```{r}
install.packages("mlbench")
install.packages("C50")
install.packages("magrittr")
install.packages("ROSE")
install.packages("rpart")
```

```{r}
library(caret)
library(mlbench)
library(C50)
library(dplyr)
library(plotly)
library(caret)
library(ROSE)
library(rpart)
library(GGally)
```
#TODO: Needs to chech accuracy, precision and f-measure

Nessa atividade você irá usar seus conhecimentos sobre classificação para prever quais candidatos à Câmara de Deputados serão eleitos nas eleições de 2014. De forma específica faremos o seguinte:

>> 1 Há desbalanceamento das classes (isto é, uma classe tem muito mais instâncias que outra)? Em que proporção? Quais efeitos colaterais o desbalanceamento de classes pode causar no classificador? Como você poderia tratar isso? (10 pt.)
>> 2 Treine: um modelo de KNN, regressão logística, uma árvore de decisão e um modelo de adaboost. Tune esses modelos usando validação cruzada e controle overfitting se necessário, considerando as particularidades de cada modelo.  (20 pts.)
>> 3 Reporte precision, recall e f-measure no treino e validação. Há uma grande diferença de desempenho no treino/validação? Como você avalia os resultados? Justifique sua resposta. (10 pt.)
>> 4 Interprete as saídas dos modelos. Quais atributos parecem ser mais importantes de acordo com cada modelo? (20 pts.)
>> 5 Envie seus melhores modelos à competição do Kaggle. Faça pelo menos uma submissão. Sugestões para melhorar o modelo: (20 pts.)
>>>> 1 Experimente outros modelos (e.g. SVM, RandomForests e GradientBoosting).
>>>> 2 Experimente balancear as classes,  caso estejam desbalanceadas.
>>>> 3 Experimente outras estratégias de ensembles (e.g. Stacking)

Os dados estão neste link: https://www.kaggle.com/c/ufcg-cdp-20182-lab3/data (Links para um site externo)Links para um site externo

Para a entrega envie o link no RPubs e os arquivos .Rmd com o código em R. Para as respostas esperamos explicações textuais e visualizações para cada questão.

Setting up workspace
```{r}
setwd("~/git/data-analysis/lab03/")
```


Loading DATA

Our data frame will be the train.csv file, in which we'll peform predictions models
and test.csv will be used to Caggle challenge.

```{r}
data <- read.csv("data/all/train.csv")
test.kaggle <- read.csv("data/all/test.csv")
```

Here we gonna see the correlation between the variables, then will se the ones which has a strong correlation and remove, because keep both would be redundant for our prediction model.

```{r}
data.correlation1 <- data %>% select(-c(sequencial_candidato, nome, estado_civil, ano, cargo))

data.correlation <- data.correlation1  %>%
  mutate(situacao = as.factor(situacao)) %>%
  mutate(uf = as.factor(uf)) %>%
  mutate(partido = as.factor(partido)) %>%
  mutate(sexo = as.factor(sexo)) %>%
  mutate(grau = as.factor(grau)) %>%
  mutate(ocupacao = as.factor(ocupacao))

data.correlation %>% 
  select(-partido,
         -uf,-grau,-sexo) %>%
  na.omit() %>%
  ggcorr(palette = "RdBu",
         color = "grey50",
         label = TRUE, hjust = 1,
         label_size = 3, size = 4,
         nbreaks = 5, layout.exp = 7) +
  ggtitle("Gráfico de correlação eleições 2006")
```


We choosed to remove those three categoric variables in order to run the model, otherwise it would take too much time. But for a better result you could let them on the data. And also remove those variable which have strong correlation

```{r}
data <- data %>%
  select(-cargo, -nome, -ocupacao, -sexo, -total_despesa, -total_receita, -sequencial_candidato )
test.kaggle <- test.kaggle %>%
  select(-cargo, -nome, -ocupacao, total_despesa, -total_receita)
```

In the data would be better replace the NA for the column media, but we choosed replace by zero.


```{r}
data[is.na(data)] <- 0
test.kaggle[is.na(test.kaggle)] <- 0
```

As our target is to predict the variable *situacao* we need to see if our data is balanced, so what is the class distribution? 


```{r}
data_class_destribution <- data %>% group_by(situacao) %>% summarize(class_count = n())
p <- plot_ly(data_class_destribution, x = ~situacao, y = ~class_count, type = 'bar',
        marker = list(color = c('rgba(204,204,204,1)', 'rgba(222,45,38,0.8)'))) %>%
  layout(title = "Class Balance",
         xaxis = list(title = "Situation"),
         yaxis = list(title = "Count"))
p
```

cleary unbalanced!
So what should we do? We gonna balance it.
There is some ways to balance data which are:
>> 1. Undersampling
That method reduces the number of observation from the majoritary class in order to balance the data set.
>> 2. Oversampling
This method increase the number of observation from the minoritary class and make it balanced.
>> 3. Both Sampling
Here it uses the technique 1 and 2 to make the data set balanced 
>> 4. ROSE Sampling
Data synthetic generation and it provades a better stimation of original data.




Before balance it we gonna do a experiment. Let's create a model and see how is goes whitout balance in order to predict and see accuracy to compare in the future



For tu build our models we gonna need data to train and test so we'll divid the original data into train and test, 70% to raing and 30% to test.

```{r}
set.seed(42)
index <- createDataPartition(data$situacao, p = 0.7, list = FALSE)
unbalanced.train <- data[index, ]
unbalanced.test <- data[-index, ]
```


Decision Tree whit unbalanced data

```{r}

treeimb <- rpart(situacao ~ ., data = unbalanced.train)
pred.treeimb <- predict(treeimb, newdata = unbalanced.test)

accuracy.meas(unbalanced.test$situacao, pred.treeimb[,2])
```
```{r}
roc.curve(unbalanced.test$situacao, pred.treeimb[,2], plotit = F)
```

Surprisely we've got a good precision and recall. Anyways let's see how it goes whit balabced data.



Lets balance it, all data by using the 4 method ROSE Sampling which it gonna generate syntetich data.


```{r}
data.rose <- ROSE(situacao ~ ., data = data, seed = 1)$data
table(data.rose$situacao)
```
YEAH!

It looks pretty balanced now. That is great, so we now gonna peform some models and avaliate its metrics.

Yes, we need to particionate our balanced data now, using the same schema before.

```{r}
set.seed(42)
index <- createDataPartition(data.rose$situacao, p = 0.7, list = FALSE)
train <- data.rose[index, ]
test <- data.rose[-index, ]
```

>> 2 Treine: um modelo de KNN, regressão logística, uma árvore de decisão e um modelo de adaboost. Tune esses modelos usando validação cruzada e controle overfitting se necessário, considerando as particularidades de cada modelo.  (20 pts.)

#knn
First model is Knn.

k-nearest neighbour classification for test set from training set. For each row of the test set, the k nearest (in Euclidean distance) training set vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the kth nearest vector, all candidates are included in the vote.


```{r}
fitControl <- trainControl(method = "repeatedcv", 
                           number = 10,
                           repeats = 10)

preProcess = c("center", "scale","nzv" )
```

```{r}
model.knn <- train(situacao ~ ., 
               data = train,
               trControl = fitControl,
               method = "knn", # pode ser 'lasso'ldf
               metric = "Accuracy",
               preProcess = preProcess)

model.knn
```


```{r}
knn_prediction <- predict(model.knn,test)

knn_data <- data.frame(pred = knn_prediction, obs = test$situacao)

knn_cv <- round(defaultSummary(knn_data),digits = 4)

knn_cv
```

#Logistic Regression
Second model to be build.
That model aims to fit a regression curve, y= f(x), when y is a categorical variable.


```{r}
model.logistic_reg <- train(situacao ~ ., 
               data = train,
               trControl = fitControl,
               method = 'LogitBoost', 
               metric = "Accuracy",
               preProcess = preProcess)

model.logistic_reg
```



```{r}
logistic_reg_prediction <- predict(model.logistic_reg,test)

logistic_reg_data <- data.frame(pred = logistic_reg_prediction, obs = test$situacao)

logistic_reg_cv <- round(defaultSummary(logistic_reg_data),digits = 4)

logistic_reg_cv

```

  
#Decision Tree
Third model 
Decision tree is a graph to represent choices and their results in form of a tree. The nodes in the graph represent an event or choice and the edges of the graph represent the decision rules or conditions.

```{r}
new_index <- createDataPartition(data.rose$situacao, p = 0.7, list = FALSE)
new_train_data <- data.rose[index, ]
new_test_data  <- data.rose[-index, ]


new_treeimb <- rpart(situacao ~ ., data = new_train_data)
new_pred.treeimb <- predict(new_treeimb, newdata = new_test_data)


accuracy.meas(new_test_data$situacao, new_pred.treeimb[,2])
```

```{r}
model.tree_dec <- train(situacao ~ .,
                data= train, 
                method = "rpart",
                trControl = fitControl,
                cp=0.001,  
                metric = "Accuracy",
                maxdepth=20)
model.tree_dec
```

```{r}
tree_prediction <- predict(model.tree_dec,test)

tree_data <- data.frame(pred = tree_prediction, obs = test$situacao)

tree_cv <- round(defaultSummary(tree_data),digits = 4)

tree_cv

```


#AdaBoost
Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.

```{r}


model.adaboost <- train(situacao ~ media_receita + media_despesa, 
               data = data.rose,
               trControl = fitControl,
               method = 'adaboost', 
               metric = "Accuracy",
               preProcess = preProcess)

model.adaboost


```


```{r}
adaboost_prediction <- predict(model.logistic_reg,train)

adaboost_data <- data.frame(pred = logistic_reg_prediction, obs = train$situacao)

adaboost_cv <- round(defaultSummary(logistic_reg_data),digits = 4)

adaboost_cv

```

##Which atribuite are most important to each model

>> 4 Interprete as saídas dos modelos. Quais atributos parecem ser mais importantes de acordo com cada modelo? (20 pts.)

###KNN
```{r}
varImp(model.knn)
```


###Logistic Regression

```{r}
varImp(model.logistic_reg)
```

###Decision Tree

```{r}
varImp(model.tree_dec)
```

###AdaBoost

```{r}
varImp(model.adaboost)
```


## Kaggle challenge
As far we can see for *ano* and *sexo* we've a low outcome for importance, so those variables should be removed.



As propose in the activite we are going to use our best model to submite the votos prediction to the challenge in Kaggle. 

>> 5 Envie seus melhores modelos à competição do Kaggle. Faça pelo menos uma submissão. Sugestões para melhorar o modelo: (20 pts.)
>>>> 1 Experimente outros modelos (e.g. SVM, RandomForests e GradientBoosting).
>>>> 2 Experimente balancear as classes,  caso estejam desbalanceadas.
>>>> 3 Experimente outras estratégias de ensembles (e.g. Stacking)

#TODO: It needs to be reviside 




```{r ,warning=FALSE, message=FALSE}
model.logistic_reg $xlevels[["ocupacao"]] <- union(model.logistic_reg$xlevels[["ocupacao"]], levels(test.kaggle$ocupacao))
prediction_ <- predict(model.logistic_reg , test.kaggle)
ID <- test.kaggle %>%
  select(sequencial_candidato)
colnames(ID)[colnames(ID)=="sequencial_candidato"] <- "ID"
predicted_file <- ID
predicted_file$situacao <- prediction_
predicted_file$situacao[predicted_file$situacao < 0] <- 0
write.csv(predicted_file, "sample_submission.csv", row.names=FALSE)
```







usefull links:
http://www.treselle.com/blog/handle-class-imbalance-data-with-r/
https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/
https://shiring.github.io/machine_learning/2017/04/02/unbalanced 



